{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b9a4225b-1a05-4b58-9e1d-1511650ef225",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: hnswlib in /home/ubuntu/miniconda3/envs/cornac/lib/python3.10/site-packages (0.7.0)\n",
      "Requirement already satisfied: numpy in /home/ubuntu/miniconda3/envs/cornac/lib/python3.10/site-packages (from hnswlib) (1.26.2)\n"
     ]
    }
   ],
   "source": [
    "!pip install hnswlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "74a9e78f-3e8a-4ee2-89fe-b3a3f4784b53",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import hnswlib\n",
    "import cornac\n",
    "from cornac.data import Reader\n",
    "from cornac.datasets import netflix\n",
    "from cornac.eval_methods import RatioSplit\n",
    "from cornac.models import MF, HNSWLibANN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf6bb9a5-ffb5-4221-8122-9aa286af1d9c",
   "metadata": {},
   "source": [
    "## Recommender model training\n",
    "\n",
    "The following experiment shows how to perform ANN-search within Cornac. First, we need to train a model that supports ANN search. Here we choose MF for simple illustration purpose. Other models that support ANN search should work in a similar fashion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "76a0c130-7dd7-4004-a613-5b123dcc75d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rating_threshold = 1.0\n",
      "exclude_unknowns = True\n",
      "---\n",
      "Training data:\n",
      "Number of users = 9986\n",
      "Number of items = 4921\n",
      "Number of ratings = 547022\n",
      "Max rating = 1.0\n",
      "Min rating = 1.0\n",
      "Global mean = 1.0\n",
      "---\n",
      "Test data:\n",
      "Number of users = 9986\n",
      "Number of items = 4921\n",
      "Number of ratings = 60747\n",
      "Number of unknown users = 0\n",
      "Number of unknown items = 0\n",
      "---\n",
      "Total users = 9986\n",
      "Total items = 4921\n",
      "\n",
      "[MF] Training started!\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "018ab8d62ba047868f24b1b138ecd40e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/25 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimization finished!\n",
      "\n",
      "[MF] Evaluation started!\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "916ea093c8be41deaafdb2e5cc546b54",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Ranking:   0%|          | 0/8233 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "TEST:\n",
      "...\n",
      "   |    AUC | Recall@20 | Train (s) | Test (s)\n",
      "-- + ------ + --------- + --------- + --------\n",
      "MF | 0.8530 |    0.0669 |    0.9060 |   6.3865\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = netflix.load_feedback(variant=\"small\", reader=Reader(bin_threshold=1.0))\n",
    "\n",
    "ratio_split = RatioSplit(\n",
    "    data=data,\n",
    "    test_size=0.1,\n",
    "    rating_threshold=1.0,\n",
    "    exclude_unknowns=True,\n",
    "    verbose=True,\n",
    "    seed=123,\n",
    ")\n",
    "\n",
    "mf = MF(\n",
    "    k=50, \n",
    "    max_iter=25, \n",
    "    learning_rate=0.01, \n",
    "    lambda_reg=0.02, \n",
    "    use_bias=False,\n",
    "    verbose=True,\n",
    "    seed=123,\n",
    ")\n",
    "\n",
    "auc = cornac.metrics.AUC()\n",
    "rec_20 = cornac.metrics.Recall(k=20)\n",
    "\n",
    "cornac.Experiment(\n",
    "    eval_method=ratio_split,\n",
    "    models=[mf],\n",
    "    metrics=[auc, rec_20],\n",
    "    user_based=True,\n",
    ").run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b00b5236-1a94-43a9-bfb9-6dc4623bb33c",
   "metadata": {},
   "source": [
    "## Building index for ANN recommender\n",
    "\n",
    "After MF model is trained, we need to wrap it with an ANN recommender. We employ Cornac built-in HNSWLibANN which implements [HNSW algorithm](https://arxiv.org/abs/1603.09320) for building index and doing approximate K-nearest neighbor search. More on how to tune the hyper-parameters at https://github.com/nmslib/hnswlib and https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fa280b9d-ec04-41eb-9de2-acfb67fbeb80",
   "metadata": {},
   "outputs": [],
   "source": [
    "ann = HNSWLibANN(\n",
    "    model=mf,\n",
    "    M=16,\n",
    "    ef_construction=100,\n",
    "    ef=50,\n",
    "    seed=123,\n",
    "    num_threads=-1,\n",
    ")\n",
    "ann.build_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b67fe0eb-7ca5-4ff9-87a5-a34ab612036e",
   "metadata": {},
   "source": [
    "## Time/accuracy tradeoff"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1846494-e618-48ef-b489-45d02e0d5430",
   "metadata": {},
   "source": [
    "Here we measure the tradeoff between efficiency and accuracy. Let say we do top-20 recommendations for 10,000 users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "46c5f959-f706-4406-85c9-a4acf2ac20c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "K = 20\n",
    "N = 10000\n",
    "test_users = np.random.RandomState(123).choice(mf.user_ids, size=N)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6730a93f-729d-4bea-b5aa-8fb9d9cef867",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 14s, sys: 18.1 ms, total: 1min 14s\n",
      "Wall time: 1.56 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "mf_recs = []\n",
    "for uid in test_users:\n",
    "    mf_recs.append(mf.recommend(uid, k=K))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "520da23e-79eb-4c53-9aaa-a33aaea883fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 218 ms, sys: 32 µs, total: 218 ms\n",
      "Wall time: 216 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "ann_recs = []\n",
    "for uid in test_users:\n",
    "    ann_recs.append(ann.recommend(uid, k=K))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2f4f68a-c69b-4e70-b32a-81eb78d21279",
   "metadata": {},
   "source": [
    "While it took MF 4.98s to complete the task, it's only 285ms for ANN. The speed up is about 17 times. Note that our dataset contains less than 5000 items. We will see an even bigger improvement with more items and with high dimensional factors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d9e5dde7-8012-44e7-b407-d7140047c576",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "99.87450000000001\n"
     ]
    }
   ],
   "source": [
    "recalls = []\n",
    "for mf_rec, ann_rec in zip(mf_recs, ann_recs):\n",
    "    recalls.append(len(set(mf_rec) & set(ann_rec)) / len(mf_rec))\n",
    "print(np.mean(recalls) * 100.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf472812-70c8-4146-b61c-b3c1f5c5b0a5",
   "metadata": {},
   "source": [
    "In terms of recall, we only see a small drop of less than 1% meaning recommendations are very similar between the two. While it's almost a free lunch for this case, the numbers might differ for other cases. It's always good to make sure that ANN maintains consistent recommendations with the base model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daf585e5-f84b-4dc5-99f2-d6a5bd05ede4",
   "metadata": {},
   "source": [
    "## Save/load for deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1fec7413-482b-417b-a12f-362a93aefb85",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'save_dir/HNSWLibANN/2023-12-08_19-27-58-137671.pkl'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ann.save(\"save_dir\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b897ead8-ad69-46d4-8903-2da65b055c54",
   "metadata": {},
   "outputs": [],
   "source": [
    "loaded_ann = HNSWLibANN.load(\"save_dir/HNSWLibANN\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74211271-2e88-4632-b18b-c87d3dfb478b",
   "metadata": {},
   "source": [
    "Let's compare top-K recommendations for 5 random users between the original ANN and the loaded ANN. Of course they should be the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "3cdeef2e-184a-4f76-87da-b0010b238a49",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.array_equal(\n",
    "    ann.recommend_batch(test_users[:5], k=K), \n",
    "    loaded_ann.recommend_batch(test_users[:5], k=K),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1699eac6-8b72-4dbc-87dd-94e68080bbb8",
   "metadata": {},
   "source": [
    "One more test, the loaded ANN should achieve the same recall as the original one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7982d439-aadd-434a-8dd3-01d462486350",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "99.87450000000001\n"
     ]
    }
   ],
   "source": [
    "loaded_ann_recs = []\n",
    "for uid in test_users:\n",
    "    loaded_ann_recs.append(loaded_ann.recommend(uid, k=K))\n",
    "    \n",
    "recalls = []\n",
    "for mf_rec, ann_rec in zip(mf_recs, loaded_ann_recs):\n",
    "    recalls.append(len(set(mf_rec) & set(ann_rec)) / len(mf_rec))\n",
    "print(np.mean(recalls) * 100.0)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cornac",
   "language": "python",
   "name": "cornac"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
